This document explains how to performs ionomics data analysis including gene network and enrichment analysis.
To explore the pipeline, we’ll use the ionomics data set:
ion_data <- read.table("./test-data/iondata.tsv", header = T, sep = "\t")
dim(ion_data)
#> [1] 9999 16
Ten random lines are shown as:
sample_n(ion_data, 10)
| Knockout | Batch_ID | Ca | Cd | Co | Cu | Fe | K | Mg | Mn | Mo | Na | Ni | P | S | Zn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YDR116C | 9 | 20.22 | 0.97 | 0.22 | 1.66 | 13.40 | 2667.78 | 427.54 | 1.08 | 0.53 | 198.37 | 1.50 | 2768.73 | 382.55 | 14.74 |
| YKL126W | 27 | 30.63 | 1.00 | 0.14 | 1.32 | 30.07 | 2453.94 | 713.85 | 1.26 | 0.80 | 190.05 | 1.09 | 4546.71 | 450.01 | 16.24 |
| YDL227C | 89 | 32.20 | 1.03 | 0.15 | 1.51 | 8.45 | 2515.41 | 745.91 | 1.06 | 1.05 | 213.89 | 1.39 | 4668.83 | 465.43 | 12.98 |
| YDR101C | 9 | 27.39 | 0.74 | 0.18 | 1.74 | 6.46 | 3096.10 | 435.05 | 1.21 | 0.49 | 145.92 | 1.05 | 2813.35 | 441.19 | 14.18 |
| YDL227C | 44 | 47.31 | 1.04 | 0.18 | 2.19 | 12.41 | 3247.68 | 787.86 | 1.53 | 1.88 | 248.47 | 1.48 | 4953.02 | 538.15 | 20.59 |
| YDL227C | 1 | 75.23 | 0.73 | 0.16 | 2.17 | 15.47 | 3145.72 | 531.51 | 1.45 | 1.55 | 221.13 | 2.07 | 3806.32 | 809.14 | 16.49 |
| YHR195W | 20 | 46.38 | 0.95 | 0.22 | 1.97 | 10.43 | 2714.71 | 671.66 | 1.41 | 1.06 | 254.95 | 1.66 | 4368.57 | 841.84 | 15.43 |
| YDL227C | 79 | 29.55 | 0.88 | 0.12 | 1.25 | 4.33 | 2391.78 | 581.71 | 1.02 | 1.10 | 206.23 | 0.78 | 4403.04 | 278.93 | 9.70 |
| YLR396C | 77 | 62.40 | 0.97 | 0.15 | 1.25 | 7.34 | 1930.18 | 718.25 | 0.86 | 0.88 | 122.54 | 1.36 | 4656.30 | 795.09 | 22.06 |
| YDL227C | 60 | 45.64 | 0.88 | 0.18 | 1.55 | 8.45 | 2452.21 | 726.69 | 1.29 | 1.44 | 254.35 | 1.33 | 4530.69 | 561.93 | 18.89 |
We can see that the first few columns are meta information such as gene ORF and batch id. The rest is the ionomics data.
The raw data set is needed to be pre-processed. This involves:
For batch correction, control line could be used. If so, the values belong to control lines are used to be the basis of batch correlation. This data has a control line: YDL227C mutant. The code segment below is to identify it:
max(with(ion_data, table(Knockout)))
#> [1] 1617
which.max(with(ion_data, table(Knockout)))
#> YDL227C
#> 209
Standarisation provides a custom method. This allows user to use specific std values such as:
std <- read.table("./test-data/user_std.tsv", header = T, sep = "\t")
std
#> Ion sd
#> 1 Ca 0.1508
#> 2 Cd 0.0573
#> 3 Co 0.0580
#> 4 Cu 0.0735
#> 5 Fe 0.1639
#> 6 K 0.0940
#> 7 Mg 0.0597
#> 8 Mn 0.0771
#> 9 Mo 0.1142
#> 10 Na 0.1075
#> 11 Ni 0.0784
#> 12 P 0.0597
#> 13 S 0.0801
#> 14 Zn 0.0671
The outlier detection here is univarite method, with a threshold to control
the number of outliers. The larger the threshold (thres_outl) the more
outlier removal.
The pre-process procedure returns not only processed ionomics data but also
a symbolic data. This data is based on the inomics data and a
threshold(thres_symb):
0 if ionomics data located between [-thres_symb, thres_symb]1 if ionomics data larger than thres_symb-1 if ionomics data smaller than -thres_symbThs symbolic data is important since the network analysis will use it along with ionomics data, and enrichment analysis will be performd only based on it. It also should be noted that the symblic data is sensitive to the choices of the threshold.
Let’s run the pre-process procedure:
pre <- PreProcessing(data = ion_data,
var_id = 1, batch_id = 2, data_id = 3,
method_norm = "median",
control_lines = "YDL227C",
control_use = "control",
method_outliers = "IQR",
thres_outl = 3,
stand_method = "std",
stdev = NULL,
thres_symb = 3)
names(pre)
#> [1] "stats.raw_data" "stats.outliers" "stats.batch_data"
#> [4] "data.long" "data.gene.logFC" "data.gene.zscores"
#> [7] "data.gene.symb" "plot.dot" "plot.hist"
The pre-processed data are returned with some summaries, one of them is the z-score.
pre$stats.batch_data %>% kable(caption = 'Summary: z-scores', digits = 2) %>%
kable_styling(full_width = F, font_size = 10)
| Ion | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | Variance |
|---|---|---|---|---|---|---|---|
| Ca | -4.45 | -0.28 | -0.13 | -0.12 | 0.02 | 2.35 | 0.11 |
| Cd | -1.70 | 0.03 | 0.10 | 0.11 | 0.17 | 0.93 | 0.03 |
| Co | -2.80 | 0.02 | 0.09 | 0.06 | 0.15 | 1.60 | 0.05 |
| Cu | -0.66 | -0.10 | -0.03 | -0.01 | 0.04 | 5.28 | 0.04 |
| Fe | -7.48 | -0.17 | -0.06 | -0.02 | 0.07 | 6.88 | 0.14 |
| K | -2.21 | -0.17 | -0.01 | -0.08 | 0.09 | 1.83 | 0.08 |
| Mg | -1.84 | -0.06 | 0.01 | -0.01 | 0.07 | 1.69 | 0.03 |
| Mn | -4.11 | -0.24 | -0.08 | -0.13 | 0.01 | 1.78 | 0.06 |
| Mo | -2.03 | -0.26 | -0.08 | -0.08 | 0.09 | 4.44 | 0.13 |
| Na | -7.41 | -0.53 | -0.22 | -0.33 | -0.04 | 1.25 | 0.24 |
| Ni | -2.40 | -0.01 | 0.09 | 0.12 | 0.21 | 7.90 | 0.12 |
| P | -1.18 | -0.06 | 0.00 | -0.01 | 0.06 | 1.45 | 0.02 |
| S | -2.38 | -0.03 | 0.05 | 0.06 | 0.16 | 2.38 | 0.04 |
| Zn | -0.46 | -0.08 | -0.03 | -0.01 | 0.03 | 4.60 | 0.02 |
The pre-processed data and its symbolic data are like like:
pre$data.gene.zscores %>% head() %>%
kable(caption = 'Pre-processed data', digits = 2) %>%
kable_styling(full_width = F, font_size = 10,
latex_options = c("striped", "scale_down"))
| Line | Ca | Cd | Co | Cu | Fe | K | Mg | Mn | Mo | Na | Ni | P | S | Zn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YAL004W | -1.16 | 0.75 | 1.19 | -0.47 | 0.04 | 0.61 | 0.51 | -0.84 | -0.08 | -1.84 | 1.71 | 0.52 | 0.33 | -0.09 |
| YAL005C | -1.67 | 0.84 | 0.55 | 0.58 | -2.79 | 0.59 | 0.31 | -1.16 | -1.42 | -0.12 | 1.48 | 0.73 | 0.13 | -0.13 |
| YAL007C | -2.12 | 0.64 | 0.23 | -0.53 | -0.24 | 0.79 | -0.09 | -0.14 | 1.22 | -0.92 | 0.00 | 0.09 | -0.29 | -0.65 |
| YAL008W | -2.34 | 1.13 | 0.21 | -0.73 | -2.16 | 0.52 | -0.02 | -0.87 | 0.93 | -0.58 | 0.02 | -0.09 | -0.73 | -0.47 |
| YAL009W | -1.18 | 0.66 | 0.55 | -1.11 | -3.91 | 0.22 | 0.09 | -0.18 | 1.50 | -0.84 | -0.09 | 0.14 | 0.01 | -0.36 |
| YAL010C | -1.28 | 1.43 | 2.27 | 0.46 | 1.53 | -2.75 | 0.04 | -0.74 | -9.71 | -4.30 | 2.42 | -0.98 | -0.05 | -0.01 |
pre$data.gene.symb %>% head() %>%
kable(caption = 'Symbolic data') %>%
kable_styling(full_width = F, font_size = 10)
| Line | Ca | Cd | Co | Cu | Fe | K | Mg | Mn | Mo | Na | Ni | P | S | Zn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| YAL004W | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YAL005C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YAL007C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YAL008W | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YAL009W | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| YAL010C | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 |
The pre-processed data distribution is:
pre$plot.hist
Figure 1: Ionome data distribution plot
There are a lot of ways to filter gene. Here we filter gene based on symbolic data:
data <- pre$data.gene.zscores
data_symb <- pre$data.gene.symb
idx <- rowSums(abs(data_symb[, -1])) > 0
dat <- data[idx, ]
dat_symb <- data_symb[idx, ]
dim(dat)
#> [1] 549 15
The gene network is based on the ionomics and symboloc data and uses the similarity measure results to build up the network. The similarity measure method is one of pearson, spearman, kendall, cosine, mahal_cosine or hybrid_mahal_cosine.
First, the Pearson correlation is used to build up the network:
net <- GeneNetwork(data = dat,
data_symb = dat_symb,
min_clust_size = 10,
thres_corr = 0.75,
method_corr = "pearson")
net$plot.pnet1
Figure 2: Netwok analysis based on Pearson correlation
net$plot.pnet2
Figure 3: Netwok analysis based on Pearson correlation
net$plot.impact_betweenness
Figure 4: Netwok analysis based on Pearson correlation
The node colours are indicated by either the similarity measures or the network community detection, i.e. clustering.
For the comparision, we use different similarity methods. Here use Cosine:
net_1 <- GeneNetwork(data = dat,
data_symb = dat_symb,
min_clust_size = 10,
thres_corr = 0.75,
method_corr = "cosine")
net_1$plot.pnet1
Figure 5: Netwok analysis based on Cosine
net_1$plot.pnet2
Figure 6: Netwok analysis based on Cosine
Use Hybrid Mahalanobis Cosine:
net_2 <- GeneNetwork(data = dat,
data_symb = dat_symb,
min_clust_size = 10,
thres_corr = 0.75,
method_corr = "mahal_cosine")
net_2$plot.pnet1
Figure 7: Netwok analysis based on Mahalanobis Cosine
net_2$plot.pnet2
Figure 8: Netwok analysis based on Mahalanobis Cosine
Again, we use Hybrid Mahalanobis Cosine:
net_3 <- GeneNetwork(data = dat,
data_symb = dat_symb,
min_clust_size = 10,
thres_corr = 0.75,
method_corr = "hybrid_mahal_cosine")
net_3$plot.pnet1
Figure 9: Netwok analysis based on Hybrid Mahalanobis Cosine
net_3$plot.pnet2
Figure 10: Netwok analysis based on Hybrid Mahalanobis Cosine
The KEGG enrichment analysis:
kegg <- kegg_enrich(data = dat_symb, min_clust_size = 10, pval = 0.05,
annot_pkg = "org.Sc.sgd.db")
#' kegg
kegg %>%
kable(caption = 'KEGG enrichmenat analysis', digits = 3) %>%
kable_styling(full_width = F, font_size = 10,
latex_options = c("striped", "scale_down"))
| Cluster | KEGGID | Pvalue | Count | Size | Term |
|---|---|---|---|---|---|
| Cluster 7 (36 genes) | 03010 | 0.029 | 9 | 16 | Ribosome |
| Cluster 7 (36 genes) | 00330 | 0.031 | 3 | 3 | Arginine and proline metabolism |
| Cluster 18 (15 genes) | 00290 | 0.009 | 2 | 2 | Valine, leucine and isoleucine biosynthesis |
| Cluster 18 (15 genes) | 00520 | 0.009 | 2 | 2 | Amino sugar and nucleotide sugar metabolism |
| Cluster 18 (15 genes) | 00260 | 0.012 | 3 | 6 | Glycine, serine and threonine metabolism |
| Cluster 18 (15 genes) | 00010 | 0.024 | 2 | 3 | Glycolysis / Gluconeogenesis |
| Cluster 18 (15 genes) | 01110 | 0.037 | 5 | 22 | Biosynthesis of secondary metabolites |
| Cluster 3 (11 genes) | 00400 | 0.009 | 2 | 2 | Phenylalanine, tyrosine and tryptophan biosynthesis |
| Cluster 8 (11 genes) | 01100 | 0.006 | 6 | 55 | Metabolic pathways |
| Cluster 8 (11 genes) | 00564 | 0.027 | 2 | 6 | Glycerophospholipid metabolism |
Note that there can be none results for KRGG enrichment analysis. Change
arguments such as thres_clus as appropriate.
The GO Terms enrichment analysis:
go <- go_enrich(data = dat_symb, min_clust_size = 10, pval = 0.05,
ont = "BP", annot_pkg = "org.Sc.sgd.db")
#' go
go %>% head() %>%
kable(caption = 'GO Terms enrichmenat analysis', digits = 3) %>%
kable_styling(full_width = F, font_size = 10,
latex_options = c("striped", "scale_down"))
| Cluster | ID | Description | Pvalue | Count | CountUniverse | Ontology |
|---|---|---|---|---|---|---|
| Cluster 4 (149 genes) | GO:0051336 | regulation of hydrolase activity | 0.0018 | 4 | 12 | BP |
| Cluster 4 (149 genes) | GO:0043085 | positive regulation of catalytic activity | 0.0044 | 4 | 15 | BP |
| Cluster 4 (149 genes) | GO:0035303 | regulation of dephosphorylation | 0.0068 | 2 | 3 | BP |
| Cluster 4 (149 genes) | GO:0046889 | positive regulation of lipid biosynthetic process | 0.0068 | 2 | 3 | BP |
| Cluster 4 (149 genes) | GO:1903727 | positive regulation of phospholipid metabolic process | 0.0068 | 2 | 3 | BP |
| Cluster 4 (149 genes) | GO:0044764 | multi-organism cellular process | 0.0074 | 3 | 9 | BP |
Some analysis are performed in terms of ions, i.e. feature, including PCA and correlation.
expl <- ExploratoryAnalysis(data = dat)
Figure 11: Exploratory analysis plots with respect to ionome
Figure 12: Exploratory analysis plots with respect to ionome
Figure 13: Exploratory analysis plots with respect to ionome
expl$plot.PCA_Individual
Figure 14: Exploratory analysis plots with respect to ionome
expl$plot.correlation_network
Figure 15: Exploratory analysis plots with respect to ionome